


Institutional Archive of the Naval Postgraduate School 


Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


1972 


An investigation into machine recognition of 
vowel-like sounds and their alloohones in one 
syllable words 


Hydinger, John Paul; Stubbs, Frederick Michael. 


http://ndl.handle.net/10945/16201 


This publication is a work of the U.S. Government as defined in Title 17, United 
States Code, Section 101. Copyright protection is not available for this work in the 
United States. 


Downloaded from NPS Archive: Calhoun 


Calhoun is the Naval Postgraduate School's public access digital repository for 
(8 DUDLEY research materials and institutional publications created by the NPS community. 
«ist sae Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS'‘s first 


INN KNOX appointed — and published -- scholarly author. 

| LIBRARY Dudley Knox Library / Naval Postgraduate School 

411 Dyer Road / 1 University Circle 
Monterey, California USA 93943 





http://www.nps.edu/library 


AN INVESTIGATION INTO MACHINE RECOGNITION 
OF VOWEL-LIKE SOUNDS AND THEIR ALLOPHONES 
IN ONE SYLLABLE WORDS 


John Paul Hydinger 








SIGRABUATE SUHOOL 


onterey, Galifornia 


NAVAL P 
i 
















a 
EAENe a 
g 


WBS 


| at 
Wh 

tt tp a 
NE 


adele 


Ay 





An Investigation Into Machine Recognition 
on 
Vowel-like Sounds and Their Allophones 
in 
One Syllable Words 


by 


John Paul Hydinger and Frederick Michael Stubbs 


Thesis Advisor: Robert C. Bolles 


June 1972 


Approved gor publite release; acstiibution unloncied. 





An Investigation Into Machine Recognition 
of 
Vowel-like Sounds and Their Allophones 
in 
One Syllable Words 


~ 


by 


John Paul Hydinger 
Lieutenant, United States Navy 
B. S., United States Naval Academy, 1968 


and 


Frederick Michael Stubbs 
Lieutenant, United States Navy 
B. S., University of Utah, 1965 


Submitted in partial fulfillment of the 
requirements for the degree of 


MASTER OF SCIENCE IN COMPUTER SCIENCE 
from the 


NAVAL POSTGRADUATE SCHOOL 
June, 1972 











ABSTRACT 


The goal was to recognize sustained vowel-like sounds and their 
allophones in one syllable words. A bank of filters and a digital 
sampler provided a data base for a molonie t curve fitting routine. 

The frequency range under investigation was 500-1000 Hz. A COMCOR CI 
5000 analog computer and an XDS 9300 digital computer were used. 
Although coefficient correlation was ineffective, several recommendations 


for system improvement are made. 





Il. 


JUL 


IV. 


VI: 


TABLE OF CONTENTS 


INTRODUCTION ~------------------------- - - - -- -- - - - - -- -- = - -- -- = 5 
BACKGROUND INFORMATION ---------------------------------------- 7 
A. TRADITIONAL APPROACHES -------------- 9-9-9999 ) 
B. GOAL ------------------------------------------------------ 11 
C. PRELIMINARY ASSUMPTIONS ----------------------------------- 11 
D. VOCABULARY ------------------------------------------------ 12 
E. SYSTEM OVERV LEW —n—w nn wn wn wn ei = = = = — ~~ — WZ 
OT Ce erate FX SR TEMES NA TON me a re ee ee 13 
ANALOG SYSTEM ------------------------------------------------- 16 
A. COMPARATIVE NETWORK ---~----------------------------------- 16 
B. BAND-PASS FILTER ------------------------------------------ Ly) 
C. SMOOTHING CIRCUIT ----------------------------------------- 17 
D. SAMPLING FREQUENCY ------------ a ---------- 18 
DIGITAL PROGRAM DEVELOPMENT ----------------------------------- 19 
A. INPUT DATA AVERAGING -------------------------------------- 19 
B. CHANGING SAMPLING FREQUENCY ------------------------------- 20 
Se oe VERSUS SVOOlHED DA t= =e ee 1 
D. NOISE ----------------------------------------------------- 22 
Te MEORMALEZAT ION seeeee == —-— soe 8 eee eee BD 
F. VARIABLE WEIGHTING FUNCTION ------------------------------- 23 
ET IS CAL, LN Gee ee ee ee Be 
H. SECOND DEGREE SMOOTHING --~------------~-------------------- 25 
I. DEGREE OF POLYNOMIAL FIT a aaa 
SUMMARY ~--------------+-+-+------+---- + --- - +++ - - - -- - - = - - = 26 





REISE COMMEND A TON Gt ai i ee ee ane me mee nee 27 


PeRECURVERAVERACING aera eae oe 27 
i, UGG GRUNER MTEROR) o-oo ee ee 27 
2. ein Cukiy Grit) sae ee ee eee De 
MMAR DWAREaCHANC 5 6 gee ee ee 28 
EMESECONDEDECRERMANALO@G SMOOTHING ==_-==———— === === een 28 
LINGO TAMGORRELATION === === ee ae 28 
CeORTHOCONAM—@OEDEIGIENT @ORREDA DION (=== === eee eee 28 
LCP aa SrCURE @aeHRU PICURE Smee ==—————=— a= _ en e 39 
CES eo aeePASLE TeAND TABLE Weeeee=————— eae eee ene 44 
TER TERETE AE Ne aaa 30 
IMEGICEERETERENCES seeeeseaseensan onan — 2 eee eee ne 46 
HRT YRE LTT GRIN spp gn i HS ce 47 





I. INTRODUCTION 


Attempts at speech recognition use either special purpose hardware 
or computers. In both cases, filter banks are often used. The majority 
of the work in the field has been formant and frequency analysis. 

The goal was to achieve a recognition algorithm for sustained vowel- 
like sounds and their allophones in one syllable words. It was assumed 
that a voiced audio signal could be broken into eight frequency bands 
ranging from 500 to 1000 Hz and the respective audio curves fitted to 
polynomials. It was further assumed that similar curves have similar 
coefficients. 

A hybrid system, comsisting of a COMCOR CI 5000 analog computer and 
a Xerox Data Systems 9300 digital computer, is used to effect a speech 
recognizer. Figure 1 is a diagram of the system. 

Two experiments were conducted prior to system fopienen atic 
Experimentation using various frequency ranges was attempted in order to 
resolve a frequency conflict. In the first experiment subjects listened 
to random words, whereas in the second experiment brush recordings of 
the same words were studied. 

The heart of the analog system is a parallel bank of eight band-pass 
filters. Their output is smoothed, sampled, and sent to the XDS 9300 
for analysis. Figure 2 is a diagram of the complete analog system. 

A digital program performs a fifteenth degree polynomial fit on each 
of the eight audio curves that are sampled from the analog computer. 

The program then outputs eight sets of normalized coefficients for ele- 


mentary analysis. Svstem noise is eliminated digitally, and zero data 





points occurring at the end of words are completely overlooked by the 
polynomial fitting routine. Several program modifications were incor- 


porated and their results discussed. 





II. BACKGROUND INFORMATION 


A. TRADITIONAL APPROACHES 

In the investigation of speech recognition by the direct analysis 
of a speech wave (Reddy, 1966), the goal was to produce a phonemic 
transcription of a connected utterance which is readable and bears a 
satisfactory resemblance to what was said. The problem was confined to 
a single cooperative speaker so that writing, adjusting and testing 
programs would be easier. It was felt that a "tune-in" process would 
adapt the program to a wider variety of speakers. No attempt was 
made to group the phonemes into words or higher level linguistic aaegage 

The concepts which were considered, such as amplitude normalization 
and time normalization, show some insight. In the case of sustained 
sounds and one syllable words, though, time normalization may not be 
necessary. It does not seem realistic, however, dite ate “tune~in" 
process could overcome the lack of generality in the original program. 

A procedure for segmenting connected speech (Reddy and Vicens, 1968) 
performs smoothing and differencing operations on the digitized acoustic 
waveform to generate parameters Which ere used to determine whether the 
characteristics of a sound are changing or similar. Parts that possess 
similar parameters are grouped together to form sustained segments, 
resulting in the segmentation of connected speech into parts approximately 
corresponding to phonemes. 

Smoothing looks like a reasonable operation to perform on waveforms 
before they are compared. A question that arises, though, is whether 
the smoothing should be done in the analog circuit or after the information 


has been digitized. Perhaps, too, one smoothing operation is not cnough. 





A successful limited speech recognition system (Bobrow and Klatt, 1968) 
operates within limitations along a number of dimensions. Rather than 
use continuous speech in which segmentation is a problem, the approach 
is to work with messages with easily delimited beginning and termination 
points. The set of messages is limited in number; at any one time the 
vocabulary to be distinguished can contain up to about 100 items. 
However, an item need not be a single word, but may be any short phrase. 
The system is useable by any male speaker, but must first be trained by 
him. The system, LISPER, is not designed to work well simultaneously 
for a number of different speakers, or achieve good recognition scores for 
an unknown speaker. The training period consists of a period of closed 
loop operation in which the speaker says an input message, the system 
guesses what he says, and he responds with the correct message. The 
recognition algorithm is a program that learns to identify words by 
associating the outputs of various property extractors with them. Each 
property has a corresponding feature state which may imply that the property 
is irrelevant for the current time interval, the property is relevant 
but not present, or the property is both relevant and present. 

Several advantages of this approach are: 

1. <A precise segmentation of the utterance is not required. 

2. The utterance need not be a single word. 

3. Features may be added to the system to provide desirable redundancy. 

4. The feature approach permits the introduction and testing of 


linguistic hypotheses. 





‘Two main disadvantages are: fe 

1. The current implementation is not speaker independent. 

2. The system will degrade in performance as the length of the 
vocabulary is increased or as the number of speakers that it can 
simultaneously recognize is increased. 

The differential effects upon vowel intelligibility of various degrees 
of time compression and frequency division were examined both with and 
without time restoration (Daniloff, Shriner and Zemlin, 1968). A male 
speaker and a female speaker were used. For a given percentage of dis- 
tortion, frequency division degrades vowel intelligibility more severely 
than time compression. Restoring time to normal for frequency~division 
speech does not enhance intelligibility. Vowel confusions under time 
compression are related to duration; those for frequency division 
conditions appear to be closely related to the perception of Vowel 
Formant Two, and to a lesser degree, Vowel Formant One. Patterns of 
male and female vowel confusions are generally much alike for all 
conditions and types of distortion. Results tentatively indicate 
superior female vowel intelligibility under all conditions of distortion, 
the advantage being largest for frequency division and somewhat less 
for time compression. These results suggest that over a limited range 
of frequency division up to forty percent, vowel phonemic quality is 
relatively unaffected by proportionate shifting of fundamental frequency 
and formant structure, indicating that a "relative-vowel" hypothesis 
of vowel phonemic quality may hold for limited shifts in the frequency of 


vowel spectra. 





The idea that vowel phonemic quality may hold during normalization 
is extremely important. However, the statement that vowel confusions 
under time compression are related to duration conflicts with another 
study (Seo, 1968). The method yields time compressed speech which is 
of normal pitch, and highly intelligible. It utilizes a systematic 
approach in which portions of phonemes are sectioned out without 
destroying cognitive qualities. 

Another process for the extraction of significant parameters of speech 
involves division of the speech spectrum into convenient frequency bands, 
and calculation of amplitude and zero-crossing parameters in each of these 
bands every ten milliseconds (Vicens, 1969). In the software implementation, 
a smoothing function divides the speech spectrum into two frequency 
bands (above and below 1000 Hz). In the hardware implementation, the 
spectrum is divided into three bands using bandpass filters (150-900 Hz, 
900-2200 Hz, and 2200-5000 Hz). 

As in many other approaches, considerable effort is spent investi- 
gating from one-fourth to one-half the range of human hearing. Although 
this may be the correct approach to take, the experiments discussed in 
the next section would seem to indicate otherwise. 

In an interview at Stanford Research Institute (Walker, 1972) it was 
Suggested that, rather than concentrate solely on sustained sound, it 
might be worthwhile to look at the dynamics of sounds. It was further 
Suggested that the upper limit of the frequency range to be investigated 
be increased to 10 KHz. 

An earlier conversation with some of the technical people at 
Pacific Telephone revealed that a frequency range of 500-1000 Hz would 
result in a highly intelligible sound to a human listener. If this is 


the case, either: 


10 





1. The intelligibility is context dependent. 

2. A significant speech parameter is being De Looked by the people 
who are investigating the frequencies above 1000 Hz, feeling that such 
investigation is necessary to insure adequate information. 

In particular, a considerable amount of time is spent looking for 
significant vowel information between 3000 and 4000 He. Section II will 


discuss this conflict in more detail. 


B. GOAL 

The initial goal was to attempt to program a hybrid system to recog- 
nize phonemes, or basic sustained sounds, with particular emphasis on the 
differences of similar sounds. The sustained sound, however, is static 
and therefore unrealistic in nature. The goal was then modified so 
that the investigation would include some sustained vowel-like sounds, 
then some one syllable words containing those sounds, and finally an 
attempt to break down the word to study the dynamics of the vowel-like 


sound. 


C. PRELIMINARY ASSUMPTIONS 

The original premise was that the voiced sound could be broken into 
different frequency ranges, and that a subroutine could be used that 
would perform a polynomial fit to each of the filtered audio signals. 
The coefficients from these fits would then be used as a data base for 
phoneme recognition. This implies that similar curves will have similar 
coefficients. A comparison of the coefficients from two sets of data 
that are supposed to represent the same sound leads to the theory that a 


unique correlation exists in some subset of those coefficients. 


Hal 





Correlation infers that some subset of coefficients of a sound is a 
multiple, plus or minus some error tolerance, of the same subset of 
coefficients of the same sound said at another time. This subset will 


be referred to from now on as the "characteristic subset" of a sound. 


D. VOCABULARY 

Any sound that is not a single vowel-like sound or a one syllable 
English word containing that vowel sound is outside the domain of dis- 
cussion. A vowel-like sound excludes some vowel pronounciations, such as 
dr it includes dipthongs such as ou in the word though. However, ou is 


excluded in words such as out. 


E. SYSTEM OVERVIEW 

There are three phases to speech recognition: 

1. Manipulate and sample an analog signal. 

2. Digitally analyze the samples obtained from the analog computer. 

3. Apply a recognition algorithm to the results of the digital 
analysis. 

In this research, an audio signal is filtered into eight pass bands 
after a comparator is keyed by the excitation voltage. The output from 
the filters is smoothed prior to the digital sampler. At the point of 
smoothing, the envelopes of the filtered signals may be looked at on the 
brush recorder. The digitized samples are passed to a software buffer in 
the digital program. After sampling is complete, program analysis 
attempts to fit the sample points with a high order polynomial. 

Two of the three phases have been satisfied. The current state of the 


project does not uSe a recognition algorithm. 


TZ 





IIIT. INITIAL EXPERIMENTATION 


There was a contradiction between the information gathered at SRI 
about relevant frequency ranges and that obtained from Pacific Telephone. 
Consequently, experimentation was begun by wiring two Kronhite filters 
in series to create a band-pass filter with a variable range. After 
a microphone input and an earphone output were connected, the upper and 
lower bounds of the pass band were varied to determine the comprehensabi- 
lity of randomly selected words. Several sets of twenty-five random 
words were chosen to be read by three speakers, including one female 
speaker. The listener was to wear the headphones and write down each 
word as he heard it. Eight listeners were selected, given no background 
information, and asked to put on the headset, face away from the speaker, 
and write down whatever words they heard. By so doing, no visual aids to 
speech perception were available to the listener (i.e., lip movement). 
Furthermore, care was taken to ensure that the listener could not 
hear anything except what came through the headset. 

The initial frequency range used was 500-1000 Hz as this was the 
range of primary interest. It was found that the comprehensability 
of the words that were selected ranged frem a low of 17 out of 25 
correct to a high of 19 out of 25 correct; the largest majority being 
centered at 18 out of 25 words. In 100% of the cases, the vowel 
sounds were totally perceptible. Also in every case, the sounds that 
were incorrectly transcribed were words beginning with th, d, f, ands, 
the sounds all sounding somewhat alike to every listener. The next Step 


was to change the lower bound of the filter to zero in order to discover 


18) 





any further information that might be available at the lower frequencies. 
In looking at the results of these tests, it was determined that no 
increase in information was gained. The conclusion was that the lower 
‘bound of 500 Hz was reasonable. 

The next frequency range investigated was 1000-2000 Hz, with some- 
what startling results, for there was almost a total loss of word 
recognition. This made the frequency range of 500-1000 Hz a necessary 
condition for speech recognition. 

As a check on the primary upper limit of 1000 Hz., the range 500 to 
2000 Hz. was investigated. This was done to establish an upper 
frequency bound on the remaining information. This proved to be suffi- 
cient as a one hundred percent comprehension from all listeners was 
obtained. To further narrow down this critical range, the upper limit 
of the band pass was lowered to 1500 Hz. It was found that the same 
level of understanding was present. This upper level was lowered to 
1400 Hz. without any information loss, but below this level the same 
difficulties were encountered as in the primary frequency range (i.e., 
500-1000 Hz.). 

The preceding experiment brought to light a salient point: Human 
beings possess some other faculty for speech understanding besides just 
a complete frequency spectrum analysis. But there are obviously critical 
frequency ranges because all words could not be understood at frequencies 
outside the critical range. 

It should also be noted that obtaining center frequencies for filters 
in the range around 1500 Hz. is very unreliable due to the inaccuracy 
of the hardware. This is so because the CI-5000 was designed to work 


efficiently only at frequencies below 1000 Hz. 


14 





At this stage of the experimentation the brush recorder indicated the 
original premise, 500-1000 Hz is both a necessary and a sufficient condi- 
tion for speech recognition,was correct. Efforts were concentrated on 
looking at the words and sounds which were earlier confused by the 
listeners. After several recordings, the fact was established that there 
were differences between the difficult to discern words in the upper 
frequency ranges (800-1000 Hz). 

Based on the results of the experiments, it would be reasonable to 
expect the primary frequency range to contain enough information to 


make speech recognition possible. 


IL) 





IV. ANALOG SYSTEM 


The input to the analog system is a microphone, the audio output 
of which goes through a pre-amplifier and from there is fed, via the © 
keying circuit, to a bank of eight paralleled band-pass filters. The 
output of each filter is connected to a smoothing circuit, and from 
there to the channels of the digital sampler, which in turn feeds data 


to the digital computer (see figure 2). 


A. COMPARATIVE NETWORK 

The comparative network (see figure 3, part A.) acts as a keying 
circuit for the analog system. Its function is to start the analog 
data gathering when a person speaks into the microphone. This was 
necessary in order to minimize the timing problem of speech recognition. 

The diagram shows two inputs to the comparator C06 )3; one being 
the audio input signal and the other a reference signal. By adjusting 
the potentiometer (P), the exciting voltage level can be altered. It 
is normally set just above the noise level so that random noise will not 
accidentally key the circuit. 

The output of the comparator is normally false or zero; when the 
circuit is keyed, even for an instant, delay flip flop zero (DFO) 
changes from false to true for a period of time determined by a dial 
setting. This in turn puts a true signal into T100 (TEST(1) in digital 
program) and interrupt 52 is enabled. 

In order to control the system input, a digital three position switch 
(DSO) is employed. As long as the switch is in the middle or ground 


position, it acts as a short circuit and prevents T100 from going true. 


16 





When placed in either of the two true positions, it acts as an open 
circuit and T100 can be enabled. Thus, to key the system, DSO must be 


set to true and the speaker must then excite the circuit. 


B. BAND-PASS FILTER 

It was necessary to build eight band-pass filters on the CI-5000 
analog computer. They had to be realizabie component-wise. Most textbook 
filters were realizable, but impractical as eight could not be made with 
the existing hardware. The filter chosen was selected with reluctance 
for although it met the aforementioned requirements, it was a low Q 
or low resolution filter. 

The diagram (figure 4) shows two amplifiers (A, and Ay), two inte- 
grators (I, and Iy) and three potentiometers (P, thru P3). Potentiometer 
one controls the center frequency of the filter, while potentiometers 
two and three control the band width. Table one lists the actual 
components used and Table two lists both the associated potentiometer 


settings and the filter frequency ranges. 


C. SMOOTHING CIRCUIT 

A smoothing circuit was incorporated into the system, again, due to 
hardware limitations; this will be discussed in detail in the Digital 
Program Development section under smoothed data. The output of each of 
the filters is fed into a separate smoother and from there to separate 
channels of the digital sampler. The function of the circuit is to trace 


the envelope of the audio curve. 


iy 





D. SAMPLING FREQUENCY 
The sampling frequency is controlled by two things; first, the 

frequency generator used and second, the frequency divider (PSET CTR) 
(see figure 3, part B.). Im order to attain a sample frequency of 200 
samples per second a 10 Kec frequency generator is used in conjunction 
with a division by 50, set into the PSET CTR. This generates a pulse 
into delay flip flop one (DF1) every five milliseconds. DFl, in turn, 
enables interrupt. 52 for .1 millisecond during which time a sample is 


taken by the eight used channels of the digital sampler simultaneously. 


18 





V. DIGITAL PROGRAM DEVELOPMENT 


The output from the CI 5000 is transferred to the XDS 9300 by 
means of a hardware link between the two machines. When an interrupt 
occurs, control is transferred to the subroutine which handles the buffer 
indexing, and which also calls the system subroutine which loads the 
buffer. The digitized samples from the analog computer are stored in 
the buffer until the complete set of data has been gathered. Once this 
has occurred the interrupt is disabled and the analysis begins. 

An orthogonal least-squares curve-fitting technique is applied to 
the data from each of the eight filters, and the resulting polynomial 
coefficients are printed. The coefficients are used to compute values 
for the dependent variable, which is currently plotted by hand to compare 


to brush recordings of the same data. 


A. INPUT DATA AVERAGING 

Due to core limitations, which will be discussed in the following 
section, there was not sufficient space to store all of the samples 
taken if the sampling frequency was high (i.e., around 1000 Hz). 
Therefore, an averaging technique was employed. What actually occurred 
was simply a temporary buffering of a summation of several consecutive 
points before their inocrporation into the data set to be used by the 
curve-fitting routine. From two to ten points were averaged at various 
times. This technique was later found to be unnecessary and too costly 


timewise, and was therefore eliminated. 


Ike 





: 





B. CHANGING SAMPLING FREQUENCY 

An initial, but mistaken, assumption was that samples could be taken 
up to and including oe sample every millisecond on each filter. Thus, 
for each channel one thousand data points could theoretically be obtained 
over a period of one second. However, due to limitations of core storage, 
a maximum sample size of 500 data points per filter became the upper 
limit. This limit could have been extended by the use of overlaying 
techniques in the XDS 9300 memory, but these techniques were found to 
be too slow to effectively take data at higher rates. The data that 
were obtained in using sample frequencies up to 500 samples per second 
had large discrepancies. There was an even more severe limitation in 
the sampling frequency in that samples could not be taken any faster 
than 200 points per second; thus, one sample every five milliseconds. 
The problem that existed at higher frequencies was that the buffering 
subroutine was too slow, causing a stacking of analog interrupts and 
resulting in lost data points. . 

Now that an upper bound had been established for both the sample 
frequency and the sample size, samples could be taken over a total time 
interval of two and one half seconds. However, because of the nature of 
the previously defined vocabulary, samples need only be taken for one 
second or less, with the mainstream of words lasting only one-half to 
three-quarters of a second. It was for this reason that the sample data 
set normally consisted of one hundred or one hundred and fifty data 


points representing one-half or three-quarters of a second respectively. 


20 





C. RAW VERSUS SMOOTHED DATA 

Early in the research, the data was being fed directly from the 
hardware filters to the digital sampler, the resulting data being termed 
"raw data.'' In attempting to look at the representative plots on the 
brush recorder, it was discovered that the frequency of the filtered 
audio signal was too high for the brush recorder's mechanical recording 
arm to follow accurately. In order to alleviate this problem, a 
smoothing circuit was constructed external to the analog computer 
(figure 5). The function of this circuit was to smooth the data in 
such a way as to present the envelopeof the original high-frequency 
curve. The plotting of this curve was within the mechanical capability 
of the brush recorder, and in fact led to the next step in data manipu- 
Jation. For it was this smoothed curve that was, in fact, interesting. 
Therefore, instead of the data being fed directly from the analog 
filters to the digital sampler, the signal was smoothed first (see 
figure 2). 

Sampling the higher frequency curve often gave misrepresentative 
data, whereas sampling the envelope resulted in much more consistent data. 
The curve obtained by sampling the oa data was found to be dependent 
upon two factors: (1) the initial point of sampling; and (2) the 
sampling frequency used. This was not the case when sampling on the 
envelope of the curve, for it was immaterial where the sampling started 
or what the interval was; the curve remained almost the same using 


recorded input. 


21 





D. NOISE 

As was just mentioned, the curves that came from recorded input were 
almost the same. It was this fact that led to the assumption that 
random noise was present in the system. The primary question was just 
how extensively the noise affected the input data. A way to determine this 
was to reduce the keying bias to zero, thereby causing the analog program 
to take data without an exciting voltage. Thus, the only data taken would 
be noise in the system. 

After several data runs of this type, the magnitude of the noise 
was found to be approximately one one-thousandth that of the desired 
input. It was therefore decided to truncate all information that was 
contained at the noise level and retain only three significant digits 
from the direct analog input. To ensure that the method was successful 
the initial testing process used in finding the noise was rerun. With 
a —_ input to the system, all data was successfully truncated to zero. 


Furthermore, identical inputs produced more nearly identical outputs. 


Es NORMALIZATION 

In attempting to compare two sets of coefficients, it was noticed 
that there was often a correlation if a scaling factor was applied 
to one of the sets of coefficients. The difference in the size of the 
coefficients was possibly due to the change in volume when saying a word 
from trial to trial. Consequently, the coefficients would differ from 
trial to trial. Thus, an attempt was made to normalize the equations 
based on the setting of the high order coefficient to a particular constant 


thereby causing the other coefficients to be scaled. 


22 





This technique gave very promising results for discrete sets of 
trials, but when the intersection of the characteristic subsets was 
taken, the resulting subset was found to be empty, as no correlation 
could be obtained for all data. One of the interesting points that this 
particular method reinforced was the fact that it was much easier to 
attempt correlation with a single speaker than to attempt correlation 
between different speakers. 

It is important to note that the aforementioned normalization is only 
amplitude normalization. The concept of time normalization has not been 
employed, because its importance has been realized only in the most 
recent stages of research. The idea of time normalization will be 


treated later in the paper. 


F. VARIABLE WEIGHTING FUNCTION 

Initially it was felt that unweighted data would suffice in the 
analysis of a filtered signal. The reasoning was that if sounds could 
be distinguished visually on the brush recorder, then fixed time sampling 
using a ten millisecond time interval would yield satisfactory results. 

Consideration was then given to the idea of equating the weight 
given to a particular data point to the value of the data point. The 
intent of this was to emphasize the larger peaks and deemphasize the 
smaller peaks. By so doing, the curve fitting routine would place 
greater weight on the peaks when calculating coefficients. This was 
also intended to give a zero weight to data points with zero value. 

If the sound being analyzed does not cover the full time interval 
that is being sampled, then zero data points appear at the end of the 


data set. This causes the curve fitting routine to attempt to fit not 


20 





only the non-zero data points, but the zero data points along the x-axis 
as well. By requiring the polynomial to fit the ae it was believed 
that less accurate results would be produced than if the fit were 
restricted just to the non-zero data points. The problem was alleviated 
by setting the weights of the zero data points equal to zero. 

The equating of weights to values neglected the possibility that a 
small amplitude segment of the curve might be a significant part of the 
curve. Thus, it would be underweighted and underemphasized in the curve 
fitting routine; a large amplitude segment that may not be of signifi- 
cance would be overweighted and overemphasized. Thus, the coefficients 
would be out of proportion to the significance of the curve. Therefore, 


all except the zero weights were eliminated. 


G. TIME SCALING 

The initial interval between data points was arbitrarily chosen to 
be one in the curve fitting routine. The resulting coefficients were 
out of proportion in that the low degree coefficients were many orders 
of magnitude larger than the high degree coefficients. In the comparison 
of coefficients of supposedly similar curves, the high order coefficients 
are far more important than the low order coefficients. Therefore, 
it was necessary to choose a more appropriate interval that would decrease 
the relative magnitudes of the coefficients. 

The interval size is inversely proportional to the number of data 
points being used. The use of 200 sample points requires an interval of 
0.1 units, whereas the use of 100 points requires an interval size of 
0.2 units. This size requirement is based on the present state of the 


program. 


24 





H. SECOND DEGREE SMOOTHING 

Requiring a polynomial to fit a curve with many relative maximums 
and minimums, many of which occur within a very short distance, causes 
the coefficients to inaccurately represent the envdope of the curve. 
By eliminating the minimum points, and keeping only the maximum points, 
a second degree smoothing was effected. A copy of the program segment 
used to accomplish this can be found at the end of the computer program 
section. 

This method was discarded under the current program configuration 
because it eliminated not only unimportant segments of the curve, but 
it also under certain circumstances eliminated salient features of the 


curve. 


I. DEGREE OF POLYNOMIAL FIT 

In looking at the brush recordings of some of the words used, it is 
difficult to determine just what degree of polynomial fit is necessary 
to get an accurate representation of the curve in terms of coefficients. 
At first, a twentieth degree fit was used under the “eee that 
the larger the degree of the polynomial the better the fit. After 
plotting some of the resultant curves, it became obvious that although 
a twentieth degree fit was appropriate for some of the curves, it was 
too great a degree of fit for others because minor variations in the 
curve were emphasized. A tenth degree fit was then tried in order to 
give a better average result for all of the curves. This, too, was 
inappropriate in that it was too small a degree of fit. The present 


program performs a fifteenth degree fit for all curves. 


2) 





VI. SUMMARY 


Although the current system does not recognize speech, some combi- 
nation of the present program and the recommendations made may lead to a 
speech recognizer. Two hardware limitations were encountered; it was 
impossible to construct eight high resolution filters on the CI 5000; and 
there was insufficient direct access core storage in the XDS 9300. 
Consequently, low-resolution filters and a small sample size had to be 
used. One system software limitation was encountered; the data transfer 
subroutine, ADL, was found to be too slow, thus prohibiting high 
frequency sampling. 

Based on the initial experimentation, and the results obtained thus 
far, it is possible that at least one significant speech parameter is 
being overlooked. Although frequency and formant analysis may be 
necessary, they are not sufficient for a generalized speech recognizer. 

Each word and sound investigated contained a basic wave shape, but 
due to pronunciation differences, the shape was altered sufficiently that 
coefficient correlation was not effective. The extracting of distinctive 
portions of the curve that remain the same from trial to trial should 


lead to a greater degree of correlation. 


26 





VII. RECOMMENDATIONS 


A. CURVE AVERAGING 

Instead of comparing coefficients per se, an averaging of input data 
points from trial to trial and a study of the resulting coefficients, 
appears to be a promising approach to the problem of speech recognition 
using the previously described system. This would entail using overlaying 
techniques in the XDS-9300 system. 

The main problem associated with this approach is one of timing; 


the beginning and end of the curves must coincide to be averaged. 


B. TIME NORMALIZATION 

The timing problem just mentioned in the previous section bears 
rectification immaterial of what other future changes are made to the 
program. A curve that is stretched over a longer distance bears little 
resemblance to the unstretched curve coefficient-wise. For this reason, 
any future polynomial curve fitting approach must take into account the 


problem of sound duration. 


C. SEGMENTED CURVE FITTING 

Throughout the experimentation, it was noticed that although one 
particular curve did not totally match another, there were large 
segments of the curves that matched quite well, especially in the latter 
segments. Thus, instead of one set of coefficients to represent an 
audio curve, there might be several representing various curve segments. 


Again, time normalization must be considered. 


157 





D. HARDWARE CHANGES 

The bandpass filters used were relatively tow resolution due to 
hardware limitations imposed by the CI-50u0. In order to have better 
filters, it would be necessary to construct them trom component parts. 
There is strong evidence that this would help to eliminate the harmonics 
of voiced audio signals, which cause random variance at different 


frequency ranges dependent upon the speaker. 


E. SECOND DEGREE ANALOG SMOOTHING 

Although digital second degree smoothing was found to be of no 
practical value, this does not mean that a second degree analog smoothing 
circuit would react in the same manner. Implementing this feature 
could help to alleviate minor differences in audio curves. Thus, a 


closer coefficient correlation could be effected. 


F. INPUT DATA CORRELATION 

To this point, all recommendations have concerned themselves in 
some manner with coefficient correlation. Given a time normalized 
curve, it might be interesting to attempt data point correlation of 
some form. As was pointed out in the section recommending Sonented 
curve fitting, there were often parts of the audio curves that compared 
quite favorably. By looking only at the associated data points, an 


interesting type of correlation might be accomplished. 


G. ORTHOGONAL COEFFICIENT CORRELATION 
DiS cete enh emae? Ganmertputs COcLuLClreheoucimthie Lomi ba, as described 
in section II. However, each By is dependent upon all of the orthogonal 


coefficients, C The equation is of the form: B. = KALC. 0, (x) 


J 


ac 


20 





where the O; 


j (x) are orthogonal polynomials. It is obvious from this 


that a change in only one C; (ae iealect CVCEyenme (Nerexore.) results 


could perhaps be attained by investigating the orthogonal coefficients. 


ie) 





,Hd * SNA =h eeeeS] LNdINI HYG4 L¥wWHOs. CTOTI IF alNs 

O02 GS3SIXS GL LONeeeeet ore g715S 27dwVS GAHISID Lhavl 
((SOvONOEWW ITT oO 2NG) GV 2G) LOAN. GO 

INSWSLVIS SSILDSKNGD IWLISIQ gl Solviy 
(COHIMECOOHILI( OOF IATAGCS (OOF hd S(OCHId"(GOHIA'(ETIG 

M(OOHAA COORD IMO OCH) 23° (O0OH) x" (000d andes (t2*Td)¥v NUISNS.1TS 
XX COMPAS | ae 

*x JG SHSMGQd SHL 39 SLNSIDI 4559990 JHL 36 AVHYV IAdLNG Bei - 3 


*A ONY 23 N3SML39SG SSONBNRSSIG GHL 4G AVaeV LAdLNG 3HL - ASC 
*SSTSGVIEVA LNSONSd3SO GSLVWILS3S so AVAUVY INdLAG Srl - A 

*SIHDIGM 4O AVMV BHi = Ié@ 

*SITGVINVA LNSCN3Gd4d0 GSANSS8Q 3G AVHUV SPL - Z4 

CSITGVIMVA LNSCNSZdGSONI GSANSSHG 4G AVYYVY GEL - x 

‘C3NFOISNGD 3A OL LI3 4Q BaANOTC DHL SI kx 4G SNIVA BHL = Ww 
O0O2@°31°W “SINIGd VIVO 3Q N4GnNN - y 

oCXVin) Meg T oe a1) yoo 

SJIHOGC WW YG4s GSLLI4 SI Weexe( Ten Gree etxe (2) G4(N) Ged WI 'eGnAdea 
FHL “SIGVIMVA LNSGNSd¢SGNI G3ANSSGQ SH!L SI] ¢(1)X GNV VWitvieva 
LNSON3Zd50 GIANSSEO BSHL SI (1)23 JNSHM *Z4 AVNSY GNV X AVEMY NSAID 


ONILLI3 SANPD WIWGNAIGdG S3AHVNDS LSVvat 


rt A) 


‘7 Y 


% 


ER 


ee eee ee Cr CC) CC) Ce Ce tot UW Ce tot 


30 





SYNIDA LdNUsSLNI OOIVNV TILNA SS3TIDAD LVHL déel Oi-lilv™ 


ipHD * SNIVWA so 


SSN St Ay eo. 
LdAYesiN] SYVMGaVH SISYNG 


Pet 
Oe (Ocho) sae) ee) an 


icdNesagiIn] GsSL1VsangdS OO INV HS4 LSaL 
cay. J J SD: 

O° O31) 34 

Wi TET Lt iG 


VIHV F9vHGLS VIVO LAG Au3Z 


eee CP ee C1) 

HN! 2 = [eae 

Ghee 

QO? O=(TIA 

Or AE x 

Woe WWW 

ome 

(TOV) Died 

YD o* SINTIWA =xXxX eee mee cy ia 


3Z1S WWAUSLIN] OSN1S3AS giant 


CTO; eer ec | 


eseeS] ANGNI YEA L¥nNGds (TCT) Paine 


VIVO LAdNI 38 LNGLINITYd ¥G4 YHSGEWNN SOLIS S3L114 G3NISA5 LfuNI 


(TOT) LGAGNI 


LY 


[f™ 


AEN OE 


Koy GS 


Ce a ae, 


ES Fe | 


WO © 


Sia 





(FP) ANS (Tl) ?23 


*SSTGVIYVA LNSONSd30 G3SA85S9G 4G AVY GL HSdadNG WONS VIIC SSVd 


ut 
~) 
7 


LNAYG 
hn 


\¢ 


~> OU 

Pan | 

a 
ie 


\ 


L= 


b--— ee 
ay ar 
gD ge 


™ 
-. = 
= 


BINT TOG ON 1a se a cri ee 


F9°W TI *.89 

Wee = J1 1a 

(9*OTds) LV-as64 

(SPMNW OCS] €C7) 5309) (eb 2T 9) SLITaM 
Alli 3) 

6’ Set Tt HOLILMS S3ieg2) a 


G3Sx¥ISSC SI INGLINId VLiVG iNdNI 3g] 35S O14 LS3L 
Of00T/(PILV8 la2¢01) 49d 

(*O00t*# (fy siid) xia =e 

Anflel Ze Od 

ag I ONT Vos ORY WO 1 VN) Ree ec 

Chia aero St 


(OCC eka 
OV oie a Ve 


LAdNI VIVO 38 NGIL37dw89D OL BSN AdNYealNI SYVMdeVH JAhivSTG 


a 
G. 


eas 


es 
4 


i Sigeee 
ie 
Pe 


Li 


era 


Ci 


Se a ea ae a 


Q) 


Cic? 


oS? 


t 


“ 


op A 


C2 UG 


g2 





ft \ 
ee = a NG) 


YVEAre( TZ) ¥ 

eh | SS ae 

Cy eT x eh iG x 
(linRGe( lo sttvegeacy ua 


Sivan GalHO] S4s See ieee 


(T)wda(l) 2@s42(1) 4 
((1)4) LeSS=C1 wg 
YW4de( 1) Tse(194 
Neh CT UG 

O08 =a 
O°'Ossvis 


*SIJHSI3M JQ WNS SJHL dO WoetdIoOIy 3lv WDWo 
(J) [% tin =!).4 


SMa Sige sta (as 


Die 6 

O° Oy 

OP T=" ef eiy 

OR = (aay 

O20 = Ce aie 

fe ‘t= SF oa 

T2’tel FT 6a 

O° O=4 1) Life Ge ale 
Ot Gln 


*ANG S86 GuazZ YSHL A Gi SIHDI a4 ids 


cn 


un 


Se | 


Se a 


Crt. 


' CC! 


Ve 


eee 


33 





CL) 4de¢( 1) d+4X¥dd=4Add 

(lL) WdeViI5SG<( 1) d*e(VHd We (l)k)2(i)d 
ie = Je 

WeTET CCT dg 


Ot Sccie cig 

O* (i= + dd 
WdXWd/WdXdAEV Lac 
dXd/dxXdX=Viig 14 

(CP) ndedXtyndXdkencdcikdY 


Pd 
2 
~ a 


(PR) deGgxX+dXdkX¥ =dx dx 
Roe el ere 
WhTe” TS UC 
OO zt 7i0 


Or ie nox 

Cae = cee 

OVC Gl Gl Lo. B+ sex) al 
ee cies 

Perit tive 

dW f€evsh CCOT 3d 


eAUSARUN UST SINSIT OL 33309 ONT TMdiieo Nicde 


34 


QO r = a Kes 


Wee UG. 


PRLS a- 


Os if sling {A 


(ye 


(2'gyVe(Z) ls tk 
(Tee) vaCeple( hep) vel Ti 12¢7) 
*NGILVNSLI YO4 ASVSSIDSN (2)G GNV (1)G Gs SSN WA isula 


‘My 
‘ 


1 
mi C3 


(2 GG: 


IL Ses 


dMd/ a ate 
(bye) tare eka 
Hei ee 

(lLIWde (SVeEXe (TD) 4) 207) 


( 
Wft=e!l C2 ie 





dGQ@71 NOILVZI Wwa6nNn 


dQGG1 NOILVIiNdWaD SIDVIYVA LNVUNS 


Cl Agee) A ad 
ot Ms Ig 


(Ulw) 


TW 


JI144369 ASLNdWOD NO G3BSVE SSISVIYVA LNVCN3ddC 4G NSTLYV 


tay x G> 


ier VEY Ste tw) Ve Vee eat 


Jin 
OSt GL 


VHA We (TRE TWNH IVR (2ST 


(Tana) veVl sda (T EIN) VevidglWV- 


a eee 


(, w)ly- sua 
(S65 “Fie iieis 
2*T=l 9565 ag 
INN GY <2 enn 


4+C Sh, 33 So ee ‘5 


(]) ASC I) AIAG 
aes ets 

Wed is) =u S 
gente 
TO eee Go 
(aS 
W'gel G22 30 


lAdwa) HISse 


ype DT) C201) S 
AeTel Col ag 
lye 1 V =e 


Hegel Ce J 
Ss BR) el 

Oey a 
MVE T WH ORK) V 
( [> av 
Su .de=aAcq 
vid 


daXdd/ssdas(y)1 


Nccees 


CT) cde C1) dtddXddeddXdd 


LO .O 
en te, 
es ge 


C4 
© 
{> 


(%) 
OJ 
CY 


Garey Ea 
A ut) 
<—t «? ct 


COT 


CEs Ta (ee 


ey ee a 


35 





GN 4 
oS “y 4 Pon 
OS dg 695 


S6eG rg. d( Or 1 ieee) 


AGT a Bo eb Ol seo Ei ee Geers) oe: 
BENS loa soOoe Shuey evo CNY “OS el 09: ASivg Sielsan 


. ISNYYALNT G4ALVadah30 SGEIVNY USHIGAV bb 
PIVMEONVY Vaey ViVd ghl iNnG GuazZ *eGoG Glee egnel SI 1S32 al 


Goal vaes wows Ns bo lee 
GVH SAVH SHYSLVTI3 LHOI3B V1IWV Wetd vivd sI 35S 91 15514 


Te LNOGv =e Lh ior 
COecOd 99) aieka 

G9 ° GC ae ol Vie 

CWO. = ere ey AUT Ci coe Bias 
($65 69) 41104 

DNL LGD 

S92*"92(d HILIMS 4ZSK4S)5! 


Y3IL V4 
S3jYd HG4 GANIS3SA SI SSIGVIYVA LNVGN3djd 48 LINGINTYd JI 33S G&L LSaL 


PCG e ets Oat ea) peo) Lv 

Cano pel (Ij ae jt OclG ae ia 

(/e2T (peal dia euGd SLNFITII 33989 1) LY wos 
WAN( EEE C9) FILIGM 


SINSIDI443G9 GSLVINIWS 3Q LAGLINIT Yd NIOGG 


(61)8 7 (1)d * 7000: = Yiie 
9b tel tht ad 


NOS ar es a a 


G92 
ogee 
49d 
2 
i: 
S 
Og5n 
ere 
S 
z 
— 
S 
4, +p t7 


36 





IGT wunlsd (wweloe dtl 


SINIGd 31IdWV¥S 38 15S LXSN SES 
LINYYSLNI LIVMV ONV WVH9GUd NIVW NI duGi Ad&GNS GL N&MLSY CAS 1¥4a 4l 


ISATVNY SHL NIOSSS ONV LdGNHsBSINI STEVSIG fwVsSodd NIVW NT OF & 4h: WON 
iviS 61 NYNLSYH ANYdl JT eM35d5NG GZHL NI SI vivG WW 4I 35S tl Ss 
S¢+ Pls 11 


MSC) Se Sie eet 


(era les (9+ 7 Vee (S+ Tt 1 ei 
Cr+ TI) SNES Cee TI ANS (att) Anes (t+ TI) ANG (TIT) ane O) Gy TiVo 


w444dN9 GLNI G3YBING SI YBLTI4 HDVa WoOsd LNIOd ¥ivd Ne 


(0002) 4NG NHISNGATG 
(DOTINGW OTT Canad dy ANILeGNuNS 


oe 


PC CJC 


° 
Ne 


wa? Ne 


a 


Eye J 


Sy 





ET Sl ag 


LN3WO3S 


ONTHLE RRS WLISIG 4b CNS 


S/ CLINE OD SP 1 <li 4 
Oe O~ Oita! 


See ONG al) ee beans 


Zot aes. | 
Fl olen 
Rees cal 


f 
Cl?) j@e< (fT 


] 

> 

& GL 

Siecle eon cit) =] 


Fan) 


C12 ANG< CF) si4 
eo er Se 


reel meine Jat) © eG Sol ces lea ena) 


= 


cea i 


Spite] Ct aa ie) 2 40d 41 


cee 
| 


4 
“ 
a5 


iO 
<4 


CC. 


ET 


SET 


i as 


38 


SQY3Z ONIDVIVYl ONIGNIDXS SINIOd Y¥LVO 3d SSAUWAN © Lisl Gai 


ce iomeiee No | oes Ghai HOLLINS = 8.5] 


2® INSWSLVLS YSLaV ADSBLVIG3SRKWI LYSSNI * SONIHLGGWS 33940 CNaDAS 


Seen Ca AL) 6) Cy 


oe, 





MICROPHONE 


DIGITAL 
SAMPLER 
AMP 
5 COMPARATIVE | 
> INETWoRK XDS 
1300 
ANALOG 
FILTERS 
SMOOTHIN G 
CIRCUIT 


MIVA = AUDIO SIGNAL 


BRUSH 
RECORDER 


FIGURE 1] 


51) 











| MICROPHONE 


DIGITAL 
SAMPLER | 

1, 
TO xXDS-%300 


FIGURE 2 





eet ee ome ee; a a er ge pe pn ne = = ees ee - ~— ~ - a 


REFERENCE 





VOLTAGE " 
(=) (P» T 100 

AUDIO emi) 

| SIGNAL 

TD 
L 
CH 
| =» 
as, 
T 2i1 Lt. 





B. 


FREQUENCY 
GENERATOR 


41 





= 
F 


©. 


KV Gy. 


Ib: 


—<ke 


C= 001 ut 


FIGURE 4 





SMOOTHING DIGITAL 


CIRCUIT SAMPLER 


BAND-PASS CHANNEL 


aa a | 
ILTER | 
: , | 500 (series) 





SIGNAL 


FIGURE 5 


43 





FILTER 1 __ 


I1 = AOO'1 
T2 = a064 
Ai = AOOO 
A2 = £002 
1 = POO'1 
P2 = POOO 
P34 = bOO02 
FILTER 4 

It = 2015 
T2 = A017 
At = A022 
A2 = AO? «| 
P1 = PO15 
P2 = POG 
P4 = PO21 
FILTER 7 

Ph = 2OA5 
12 = AOAT 
Al = AQA2 
AO = AOAA 
P1 = POA 
P2 = POAA 
P3 = POA2 


FILTER 2 


i = AQO5 
I2 = Ado] 
ai = AO10 
A2 = nOC6 
P1 = POO5 
P2 = POOA 
P3 = POO6 
FILTER 5 

Tt = a041 
I2 = a033 
A1 = &026 
i? = A030 
Pi = PO31 
P2 = PO27 
P34 = PO26 
FILTER 8 

Ti = A053 
T2 = A055 
Al = A050 
a2? = a052 
Pi 055 
Pe = PO52 
P3 = PO5O 


FILTER 3 

T1 = adi 
T2 = aQ13 
a1 = AOQ14 
A2 = AO16 
P71 = PO11 
Pp2 = PO12 
P3 = PO13 
FILTER 6 

T1 = a041 
T2 = A047 
Al = AQ34 
h2 = 1036 
P1 = PO37 
P2 = PO36 
P3 = PO44 


MABE 1 


44 





ery wun 
Filton 
FiuTin 
FULT 
Pa Lita 
FLL iuhs 
PILLAR 
FAL uh 


cCm~ONHP WN Hh 


POTENTIOMBT ER SLYPTINGS 


POOO 
POO1 
POO2 
POO4 
POO5 
POO6 
PO10 
PO12 
PO13 
PO15 
POi6 
POd | 
PO26 
PO27 
PO351 
PO34 
P036 
POS 
PQ42 
PO44 
PO4) 
PO50 
POs2 
PODS 
P406 


LOw Hilt 
3 db 
LiVnL 


490 
500 
0 40 
{OO 
{70 
840 
410 
YouO 


CBN Kit 


FnHEQ 


505 
575 
64) 
115 
7185 
855 
y25 
99) 


21885 
1006 
.1885 
21885 
01264 
01685 
01641 
©1885 
©1885 
©2015 
©1685 
©1885 
21885 
01885 
02429 
61885 
21885 
©2883 
©1885 
01885 
© 3376 
01085 
21885 
2 3904 


e020 


U 


PPisk 
5 db 


ui isl 


520 
590 
660 
{50 
800 
870 
90 
1010 


biasing pot 


all frequencies 
in hertz (Hz) 


MABLE & 


45 





REFERENCES 


Defense Documentation Center Report CS49, An Approach to Computer Speech 
recognition by Direct Analysis of the Speech Wave by D. R. Reddy, pp 1-143, 


1 September 1966. 


Reddy, D. R. and Vicens, P. J., "A Procedure for the Segmentation of 


Connected Speech," Journal of the Audio Engineering Society, v.16, 
pp 404-411, October 1968. 


Bobrow, D. G. and Klatt, D. h., "A Limited Speech Recognition System," 


AFIPS Conference Proceedings, v.33, pp 305-318, 1968 


Daniloff, R. G., Shriner, T. H. and Zemlin, W. R., "Intelligibility 
of Vowels Altered in Duration and Frequency," Journal of the Acoustical 
Society of America, v.44, pp 700-/07, September 1968. 


Seo, H., "Speech Compression," Dissertation Abstracts, v.288, p 3713, 
March 1968. 


United States Government Research and Development Reports, Reprocessing 
for Speech Analysis by P. Vicens, p 129, 10 January 1969. 


Walker, D., Personal Interview, Stanford Research Institute, 15 February 
1D irae 


46 





9 < 


BIBLIOGRAPHY 


Allen, J., "“Man-to-Machine Communications by Speech Part II: Synthesis 


of Prosodic Features of Speech by Rule," Spring Joint Computer Conference, 
fees 57-34 3.6 | 960. 


Bell, C. G., and others, “Reduction of Speech Spectra by Analysis-by- 


Synthesis Techniques," Journal of The Acoustical Society of America, 
v. 33, p. 1725-1736, December, 1961. 


Hill, D. R., Pattern Recognition, p. 199-226, American Elsevier, 1966. 


Massachusetts Institute of Technology Research Laboratory of Electronics 


Report 395, The Recognition of Speech by Machine, by G. W. Hughes, 
p. 1-60, 1 May 1961. 


Kurland, M. and Papson, T. P., “Analog Computer Simulation of Linear 


Modulation Systems,'' Analog/Hybrid Computer Educational Society 
Transact LOS mvs eblie Dome syedantary 1971. 


Lavington, S. H. and Rosenthal, L. E., "Some Facilities for Speech 
Processing by Computer," Computer Journal, v. 9, p. 330-339, February 
1967. | 


Lee, F. F., “Machine-to-Man Communications by Speech Part 1: - Generation 


of Segmented Phonemes from Text,’ Spring Joint Computer Conference, 
p. 333-338, 1968. 


47] 






INITIAL DISTRIBUTION LIST 


Defense Documentation Center 
Cameron Station 
Alexandria, Virginia 22314 


Library, Code 0212 
Naval Postgraduate School 
Monterey, California 93940 


LTJG Robert C. Bolles, Code 53Bq 
Department of Mathematics 

Naval Postgraduate School 
Monterey, California 93940 


LT Frederick M. Stubbs 
373-E Bergin Drive 
Monterey, California 93940 


LT John Paul Hydinger 
373-E Bergin Drive 
Monterey, California 93940 


48 


No. Copies 


2 





Security Classification 









DOCUMENT CONTROL DATA- R&D 


entered when the overall report is classified) 


24. REPORT SECURITY CLASSIFICATION 
Unclassified 


(Security classification of titfe, body of abstract and indexing annotation must be 






. ORIGINATING ACTIVITY (Corporate author) 





Naval Postgraduate School 
Monterey, California 93940 


26. GROUP 







. REPORT TITLE 






An Investigation Into Machine Recognition of Vowel-like Sounds and Their 
Allophones in One Syllable Words 






- DESCRIPTIVE NOTES (Type of report and, inclusive dates) 







Master's Thesi 10 
. AUTHOR(S) (First name, middle initial, last name) 


John Paul Hydinger 
Frederick Michael Stubbs 














7b, NO. OF REFS 


7 


7a. TOTAL NO. OF PAGES 


| 


" me 
9a. ORIGINATOR’S REPORT NUMBER(S) 







6. REPORT DATE 


June 1972 


84. CONTRACT OR GRANT NO 













ie a 







« PROJECT NO. 


95. OTHER REPORT NO(S) (Any other numbers that may be aasigned 
this report) 








- DISTRIBUTION STATEMENT 


Approved for public release; distribution unlimited. 





1t. SUPPLEMENTARY NOTES 12. SPONSORING MILITARY ACTIVITY 


Naval Postgraduate School 
Monterey, California 93940 





V3. ABSTRACT 


The goal was to recognize sustained vowel-like sounds and their allophones 
in one syllable words. A bank of filters and a digital sampler provided a 
data base for a polynomial curve fitting routine. The frequency. range under 
investigation was 500-1000 Hz. A COMCOR CI 5000 analog computer and an XDS 9300 
digital computer were used. Although coefficient correlation was ineffective, 
several recommendations for system improvement are made. 


A: —— = =s es 
DD t NOV wI14/3 (PAGE ) T Py ae 
S/N 0101-807-681 1 49 Security Clessification eres 


c 
~ Security Classification 


KEY WORDS 


Reo cst 4 73 ( BACK 


01°807-68?) 


peech Recognition 
atterm Recognition 
owel Analysis 
ord Analysis 


50 


=a bg 4 Ps x 


LINK A LINK 86 


Unclassified 


Security Classification 


« 
¢ 
, 
t 





A-e-31409 














‘a i il 





